Skip to content

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947

Open
otavio wants to merge 5 commits intomasterfrom
fix/revdial-session-keepalive
Open

fix(ssh): add WebSocket ping/pong to per-session revdial connections#5947
otavio wants to merge 5 commits intomasterfrom
fix/revdial-session-keepalive

Conversation

@otavio
Copy link
Member

@otavio otavio commented Mar 5, 2026

Summary

Fix idle SSH session disconnections caused by intermediaries (load balancers, NAT, reverse proxies) closing silent WebSocket connections. This PR addresses #5946 with per-session WebSocket keepalive and race condition fixes in the WebSocket adapter.

Changes

Per-session WebSocket ping/pong (pkg/revdial)

The V1 (revdial) per-session WebSocket connections had no server→agent keepalive. During idle SSH sessions, the only traffic was the agent's SSH keepalive (one-way, every KEEPALIVE_INTERVAL seconds). The server→agent direction was silent, causing intermediaries to detect a half-idle connection and close it.

Now Ping() is called on the agent-side adapter in grabConn(), sending WebSocket ping frames every 30s. The server automatically responds with pongs (gorilla/websocket handles this), creating bidirectional traffic that keeps the connection alive through all intermediaries.

WebSocket adapter Close() race fix (pkg/wsconnadapter/wsconnadapter.go)

The old Close() had a race condition: concurrent callers (e.g., pong timeout AfterFunc firing while normal teardown runs) could panic by sending to an already-closed channel, or call conn.Close() multiple times. Now sync.Once ensures the entire cleanup — stopping the ping goroutine and closing the WebSocket connection — happens exactly once.

Subsequent Close() calls return the stored error from the first call rather than nil, preserving error semantics for callers that check the return value.

WebSocket adapter Ping() init race fix (pkg/wsconnadapter/wsconnadapter.go)

The old Ping() guarded against re-initialization with a nil check (if a.pongCh != nil), which is racy under concurrent calls — both callers see nil, both create channels and goroutines, leaking the first set. Now sync.Once guarantees initialization happens exactly once.

Community testing

Tester Setup Result
@zoopp GCP Load Balancer, agent v0.21.4 4 sessions idle 30+ min — no disconnections
@ltan10 External nginx proxy, self-compiled agent 600 min session — no disconnections

Note: @ltan10 also reported a separate issue with the main control connection dying at 20-27 min intervals. This is unrelated to per-session keepalive — the main connection already has four independent keepalive sources at 30s. Their issue points to external proxy behavior (max connection duration, connection pool limits).

Test plan

  • go build — all services compile
  • Integration tests pass in CI (both postgres and mongo)
  • @zoopp: tested, sessions stay alive 30+ minutes
  • @ltan10: tested, session stayed alive 600 minutes
  • Final re-test with updated branch (nginx change removed)

Fixes #5946

@gustavosbarreto
Copy link
Member

I wonder if the Ping() would be better placed in grabConn() (agent-side) instead of ConnHandler() (server-side). Adding a goroutine per active session on the server doesn't scale. On the agent side each process only manages its own sessions, so the cost is distributed.

gorilla/websocket already responds to pings with pongs during NextReader(). No extra goroutine on the server, same bidirectional traffic for the LB.

// pkg/revdial/revdial.go — grabConn(), line 442
c := wsconnadapter.New(wsConn)
c.Ping()

select {
case ln.connc <- c:
case <-ln.donec:
}

@otavio otavio force-pushed the fix/revdial-session-keepalive branch from 7031a95 to fe11347 Compare March 5, 2026 18:08
@otavio
Copy link
Member Author

otavio commented Mar 5, 2026

@gustavosbarreto Good call — I've moved Ping() to grabConn() on the agent side as you suggested. This distributes the goroutine cost across agents instead of concentrating it on the server.

This fix works for the immediate issue, but we'll need to think more deeply about how we want to implement this long-term. Given that we've never had agent integration in other languages, we need to find a way to enable this back-and-forth communication that accounts for all clients. Some of our customers rely on a large number of devices, and in those cases this extra WebSocket ping/pong transmission on every active session could impact performance.

@otavio
Copy link
Member Author

otavio commented Mar 5, 2026

Sharing some research on cloud load balancer behavior that reinforces why this keepalive fix is necessary.

All major cloud vendors enforce idle timeouts on load-balanced TCP/WebSocket connections, and some require application-level data (not just TCP keepalives) to reset the timer:

Vendor LB Type Default Idle Timeout TCP Keepalive Resets Timer? App-Level Data Required?
AWS NLB 350s ✅ Yes No
AWS ALB 60s ❌ No Yes
Azure Standard LB 4 min ✅ Yes Recommended
GCP TCP Proxy 30s Conflicting docs Yes

AWS ALB and GCP TCP Proxy both require actual data in the payload to keep the connection alive — TCP-level keepalive packets aren't enough. WebSocket ping/pong frames count as application-level data, which is why our fix in #5947 works.

Ref: GCE discussion on TCP proxy LB timeouts

@otavio
Copy link
Member Author

otavio commented Mar 6, 2026

@gustavosbarreto — requesting your review on this one. The changes go beyond the original per-session keepalive fix and touch shared infrastructure, so I want to make sure nothing has unintended side effects.

What to pay attention to

1. wsconnadapter.Close() is now idempotent

The old Close() called conn.Close() on the underlying gorilla/websocket connection every time it was invoked. The new version uses sync.Once — only the first call actually closes the connection; subsequent calls return nil.

This matters because several code paths can trigger Close() concurrently or multiple times:

  • The pong timeout time.AfterFunc fires and calls Close()
  • The revdial Dialer.close() calls conn.Close()
  • yamux session teardown calls Close() on the underlying connection

If any code path currently depends on getting an error back from a second Close() call (to detect "already closed"), it will now silently get nil. I reviewed the callers and didn't find such a case, but you know the platform better.

2. wsconnadapter.Ping() initialization is now atomic

The old nil-check guard (if a.pongCh != nil) was racy under concurrent calls. The new sync.Once ensures channels and the ping goroutine are created exactly once. Current callers only call Ping() once per adapter, so this is defensive — but worth verifying no future or cloud-specific code path calls it differently.

3. Gateway Nginx proxy_read_timeout 0 on all WebSocket locations

This disables Nginx's read timeout for /ssh/connection, /ssh/revdial, /agent/connection, and /ws. Application-level keepalive (30s ping/pong + JSON keep-alive) handles liveness instead.

The concern here is zombie connections: if the application-level keepalive ever stops working (bug, deadlock), Nginx won't clean up the connection — it will hang indefinitely. Previously, the 60s default acted as a safety net (albeit a fragile one). Worth considering whether we want a large finite value (e.g., proxy_read_timeout 3600) instead of 0 as a last-resort safety net.

In V1 (revdial) transport, the main control connection has bidirectional
keepalive (ping/pong + JSON keep-alive), but per-session WebSocket
connections created via dial-back have none. During idle SSH sessions,
the only traffic on the per-session WebSocket is the agent's SSH keepalive
(agent→server, one-way every 30s). The server→agent direction is
completely silent, causing intermediaries (load balancers, NAT, firewalls)
to detect a half-idle connection and close it.

Call Ping() on the agent-side wsconnadapter in grabConn() to start
sending WebSocket ping frames every 30 seconds. gorilla/websocket
automatically responds to pings with pong frames during NextReader(),
creating bidirectional traffic that keeps the connection alive through
all intermediaries. Placing this on the agent side distributes the
goroutine cost across agents rather than concentrating it on the server.

Fixes: #5946
@otavio otavio force-pushed the fix/revdial-session-keepalive branch 4 times, most recently from 00ee90c to aa1f041 Compare March 10, 2026 12:02
otavio added 2 commits March 10, 2026 09:36
Fix race condition in wsconnadapter Close() where concurrent callers
(e.g. pong timeout AfterFunc and normal teardown) could panic on
send-to-closed-channel or double-close the WebSocket connection. Use
sync.Once to guarantee both channel and connection cleanup happen
exactly once.
The previous nil-check guard on pongCh was racy: two concurrent callers
could both see nil and create duplicate channels and goroutines, leaking
the first set. Use sync.Once to guarantee initialization happens exactly
once, consistent with the Close() fix in the previous commit.
@otavio otavio force-pushed the fix/revdial-session-keepalive branch from aa1f041 to fc41136 Compare March 10, 2026 12:37
@otavio
Copy link
Member Author

otavio commented Mar 10, 2026

@gustavosbarreto — the branch has been updated and all CI checks are now passing (unit tests + integration tests on both mongo and postgres).

The main change since your earlier review: the proxy_read_timeout 0 nginx change has been removed. During CI bisection we discovered that nginx interprets 0 as "zero seconds" (immediate timeout), not "infinite" — it was breaking all connections through the gateway. The application-level 30s ping/pong keepalive is sufficient to stay within nginx's default 60s timeout.

The remaining changes are:

  1. Per-session WebSocket ping/pong in grabConn() (agent-side, per your suggestion)
  2. Close() with sync.Once — idempotent, returns stored error on subsequent calls
  3. Ping() with sync.Once — atomic initialization
  4. Terminal resize test timeout increase (test flakiness fix)

Could you take another look when you get a chance?

The sync.Once Close() previously returned nil on repeated calls.
Store the error from the first close so callers always receive it.
@otavio otavio force-pushed the fix/revdial-session-keepalive branch from fc41136 to 18cad39 Compare March 10, 2026 16:15
The terminal_window_size_change test had a 2s per-attempt timeout
and 5s overall deadline for reading stty output over a PTY. Under
load, terminal I/O can exceed these tight limits, causing ~40%
flakiness locally. Increase to 10s per-attempt and 30s overall.

Verified with 20 consecutive runs (0 failures).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"Idle" SSH connections are forcefully disconnected

2 participants